指挥者:编译器驱动程序
将 编译器驱动程序 (如 GCC) 视为一位伟大的指挥家。它自动完成从可读源代码到二进制可执行文件的复杂转换。这一旅程,即 执行之路,始于 编译时 并延伸至 加载时 和 运行时。
通过使用 独立编译,驱动程序会分别处理 main.c 和 sum.c 。一个模块的更改无需重新翻译整个项目——只需将修改后的文件经过预处理器(cpp),编译器(cc1),汇编器(as),然后由 链接器 (ld)合并生成的 可重定位目标文件。
效率与内存层次结构
链接器对 grid[0][0] 或 src[0][0] 直接影响 吞吐量 和 延迟。通过将数据对齐到一个 32 字节缓存行,驱动程序促进了 步长为1的访问模式,最大限度减少 冷缺失 并避免 按列扫描导致的缓存行驱逐。在高级高性能代码中, 展开循环并行性($4 \times 4$ 展开循环) 进一步隐藏 主存到缓存的映射 延迟,通过优化时钟频率周期(0x32, 0x1, 0x4, 0x51)实现。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which component of the compiler driver is responsible for generating the assembly file (/tmp/main.s)?
The preprocessor (cpp)
The compiler (cc1)
The assembler (as)
The linker (ld)
✅ Correct!
Correct! cc1 translates the preprocessed C code into assembly code.❌ Incorrect
The preprocessor only handles macros and headers. cc1 is the stage that produces assembly.QUESTION 2
What is a primary benefit of 'Separate Compilation'?
It makes the final executable run faster.
It allows modifications to one file without re-translating others.
It automatically unrolls all loops to 4x4.
It eliminates the need for a linker.
✅ Correct!
Indeed. This modularity is essential for large projects to maintain manageable build times.❌ Incorrect
Separate compilation focuses on translation efficiency, not execution speed directly.QUESTION 3
How does a Stride-1 reference pattern affect the L1 cache?
It causes column-wise scan evictions.
It maximizes hit rates by utilizing spatial locality.
It bypasses the cache to reduce latency.
It increases the number of cold misses to 100%.
✅ Correct!
Correct. Accessing memory sequentially ensures that once a 32-byte cache line is loaded, subsequent nearby data is already in the cache.❌ Incorrect
Stride-1 actually minimizes evictions compared to large strides or column-wise scans.QUESTION 4
What happens at 0x064C if the linker places a multi-byte integer across a 32-byte cache boundary?
The compiler driver automatically fixes it at run time.
The L1 cache throughput is maximized.
A potential drop in hit rates and increased latency occurs.
The assembler produces a relocatable error.
✅ Correct!
Unaligned data or data spanning boundaries requires multiple cache fetches, hurting performance.❌ Incorrect
This is a performance issue that neither the driver nor assembler automatically 'fixes' without specific alignment directives.QUESTION 5
The hex representations 0x32, 0x1, 0x4, and 0x51 in the theory likely represent:
The binary tags for the L2 cache.
Clock frequency stalls or memory fetch latencies.
The sequence of registers used in a 4x4 unroll.
The static library identifiers.
✅ Correct!
Correct. These illustrate the raw timing variations involved in memory access cycles.❌ Incorrect
These values are used to describe performance characteristics during advanced optimization.Case Study: Memory Hierarchy & Hit Rates
Applying Figure 6.48 logic to cache performance
You are analyzing the performance of a program that transposes a matrix using two arrays: src and dst. Both are stored in memory addresses similar to 0x064C, 0x064D, 0x064E, and 0x064F. The system uses a 32-byte cache line. You must calculate how the driver's linking stage and the memory access pattern interact.
Q
Based on Figure 6.48, what is the hit rate for the dst and src arrays when the cache is 32 bytes and large enough to hold both arrays?
Solution:
Assuming 4-byte integers and a 32-byte cache block, there are 8 integers per cache line (32 / 4 = 8). If the cache is large enough to hold both arrays, conflict and capacity misses are eliminated, leaving only cold misses. For a Stride-1 access pattern (reading row by row), the first access to a block is a miss, and the following 7 accesses are hits. Therefore, the hit rate for both arrays is 7/8, or 87.5%.
Assuming 4-byte integers and a 32-byte cache block, there are 8 integers per cache line (32 / 4 = 8). If the cache is large enough to hold both arrays, conflict and capacity misses are eliminated, leaving only cold misses. For a Stride-1 access pattern (reading row by row), the first access to a block is a miss, and the following 7 accesses are hits. Therefore, the hit rate for both arrays is 7/8, or 87.5%.
Q
If the code uses a column-wise scan instead of Stride-1, how does this affect the 'Cache tag' and 'Cache set index' usage?
Solution:
A column-wise scan increases the stride between consecutive accesses. This likely means every access jumps to a different cache set index or causes a tag mismatch, resulting in 0% hit rates (all misses) if the matrix is larger than the cache, as the hardware must evict lines before they can be reused.
A column-wise scan increases the stride between consecutive accesses. This likely means every access jumps to a different cache set index or causes a tag mismatch, resulting in 0% hit rates (all misses) if the matrix is larger than the cache, as the hardware must evict lines before they can be reused.